Consultants: Jeremy Gebrael, Shervin Faeghi, Vince Tafae
Published
May 21, 2025
Executive Summary
pXRF data was provided to complete a Principal Component Analysis (PCA). The purpose of this report is to demonstrate the exemplar workflow that the client may use when analysing their final dataset (independent from the one provided), as an investigation into the development of brick makings in early stages of Australian colonisation as a byproduct of social enterprise (1788 - 1840). The file and subsequent code provided has been created in the RStudio environment using R. Explanatory comments have been added to each code block, which can be viewed using the unfold toggle. The code is intended to be run sequentially; ensure all previous code blocks have been run before trying to run an independent chunk.
Background
We conducted an illustrative PCA as shown below. Interpretation can be found in Analysis. However, we note that the primary motivation of our report is to give the client an idea of how to conduct a PCA. Thus of particular interest to the client is our section “How to conduct a PCA’.
Scientific Problem: The client is investigating brick samples in early colonial Australia (1788–1840). A standard in their field is PCA but the client would like guidance in how to conduct this and what this may look like in R code.
Statistical Approach: We conduct a Principal Component Analysis (PCA) on the sample data but also present a ‘cookbook’ that can guide the client in their actual analysis. Moreover, interpretation of PCA is challenging and thus we briefly interpret the illustrative PCA but provide extra resources for the client.
Data Structure and Cleaning
The provided example dataset comprises 23 independent brick specimens. With the exception of 3 specimens which were only measured 3 times, the remaining 20 were measured with 10 replicates, giving 209 readings in total. The following code is how to read in the data and remove the last 3 specimen with less measurements (for consistency).
Code
# Libraries used for analysis; may need to be downloaded locallylibrary(tidyverse)library(readxl)library(janitor)library(dplyr)library(plotly)data <-read_excel("data.xlsx") #load in data; must be in same folder as .qmddata <-clean_names(data) #makes coding a bit easier with consistent namingcolnames(data) <-gsub("_concentration$", "", colnames(data)) data <- data %>%filter(!is.na(test_no)) #remove NA rowsdata <- data %>%group_by(bricc_standard_name) %>%filter(n() >=10) %>%ungroup() #remove last 3 specimenMDL <- data[data$test_no =="Minimal Detection Limit", ] #extracting all MDL for later# Remove the rows that are calculating sample statistics by making them NA and then removing NAdata$test_no <-as.numeric(data$test_no)data <- data %>%filter(!is.na(test_no))# change variable type, factor basically means categorical variabledata <- data %>%mutate(bricc_standard_name =as.factor(bricc_standard_name))
After some preliminary cleaning, we are left with the following column names:
The instrument used for these measurements is the Olympus Vanta tool. The instrument maintains a minimum range of eligible readings. If a certain reading is less than this range, the measurement reading is returned as less than Limit of Detection, or within the dataset as “<LOD”. This is exemplified in the Uranium (U) column:
Code
head(data$u)
[1] "<LOD" "<LOD" "<LOD" "<LOD" "<LOD" "<LOD"
Based on studies by Hsieh & Fischer (2019), Mitchell et al. (2012) and Xu et al (2019), which used the following data cleaning and imputation for <LOD values. This ensures each replicate within the dataset maintains a numerical value and analytical consistency, without disrupting the overall structure of the data. Hence, our client, Matthew advised the following to deal with these LOD values:
All replicates which are < LOD: impute as 0.
Mixed replicates (some ≥ LOD, some < LOD): impute each < LOD reading with the brick’s Minimal Detection Limit (MDL).
Singleton detection (only one replicate ≥ LOD out of 10): Exclude the brick from further analysis, as a single detection is insufficient, and would be problematic for further analysis to impute MDL to all measurement readings.
Problematic Data, no conclusive numeric value: For example, if reading is >8, impute to 8 (min of range).
Code
head(data$pb)
[1] ">26" ">24" ">27" ">27" ">26" ">25"
Code
data_clean <- data #copy dataframe# Special rule for pb_concentrationif ("pb"%in%colnames(data)) { data$pb <-gsub("^>", "", data$pb)}# 1. Identify key columnsid_cols <-c("bricc_standard_name", "test_no")concentration_cols <-setdiff(names(data), id_cols)# 2. Identify <LOD locations per element column (TRUE if <LOD)is_lod <- data[concentration_cols] =="<LOD"# 3. Replace <LOD with NA and convert all to numericdata_clean[concentration_cols] <-lapply(data[concentration_cols], function(col) {suppressWarnings(as.numeric(replace(col, col =="<LOD", NA)))})# 4. Add MDL info by joining on bricc_standard_nameMDL_renamed <- MDL %>%rename_with(~paste0(., "_MDL"), .cols =-bricc_standard_name)data_with_mdl <-left_join(data_clean, MDL_renamed, by ="bricc_standard_name")# 5. Apply imputation rules for each elementfor (col in concentration_cols) {# Build per-brick summary of non-NA values non_lod_counts <- data_clean %>%group_by(bricc_standard_name) %>%summarise(non_lod =sum(!is.na(.data[[col]])), .groups ="drop")# Create mask of <LOD entries lod_mask <- data[[col]] =="<LOD"for (i inwhich(lod_mask)) { bname <- data_clean$bricc_standard_name[i] n_non_lod <- non_lod_counts$non_lod[non_lod_counts$bricc_standard_name == bname] mdl_val <- data_with_mdl[[paste0(col, "_MDL")]][i]if (length(n_non_lod) ==0||is.na(n_non_lod)) nextif (n_non_lod ==0) {# All values are <LOD for that brick — impute 0 data_clean[[col]][i] <-0 } elseif (n_non_lod ==1) {# Set entire column for that brick to NA data_clean[[col]][data_clean$bricc_standard_name == bname] <-NAnext } else {# More than one — use MDL data_clean[[col]][i] <- mdl_val } }}# Final result: data_clean with imputed values and original columns preserved#FIX; decimal points of MDL (maybe restrict to 2 for consistency), try doing in data_clean
Analysis
Preliminaries
Since the client is mainly interested in the workflow of PCA, we put this as the focus of our report. Briefly, PCA is a statistical method of reducing the dimension of a large dataset (as with the elemental composition of bricks here, which is now n rows by n columns). It does this by finding “principal components” that preserve as much information from the original data as possible. Obviously, some accuracy is sacrificed for interpretability in this sense. The mathematical interpretation can be omitted from a basic understanding. Essentially, these principal components are a combination of the original variables after a linear transformation.
ASSUMPTIONS?
Aggregation
The structure of the dataset reveals a high frequency of censored readings listed as <LOD among the measurement readings for the brick samples. Given the purpose of the investigation is to highlight compositional differences and infer from this the inherent development trends in brick development, it is integral for you to consider your PCA input as the median aggregation. The aggregated median used as the PCA input will ensure the PCA captures each brick’s typical composition, being the central value, rather than possibly being skewed by the imputed, extreme or missing values. Furthermore, when considering the methodological approach of using the handheld XRF measurement tool, they are inherently prone to producing two major errors which affects the reliability of the data: - Surface irregularities: Rough or uneven surfaces of the bricks scatter the X-rays unpredictably, leading to readings of inconsistent and/or extreme values (Mitchell, 2012). - Instrument noise and signal to noise fluctuations: Handheld XRF detectors, such as the Olympus Vanta, are sensitive to both random electronic noises inherent to the device and background noise, especially at lower concentrations (Portaspecs, 2025). Hence, this is a consideration for how your device is collecting data.
As such, the median aggregation allows us to manage these potential errors, this approach: - Centrality of range: Medians are an effective method to deal with extreme values, which would otherwise skew mean-based aggregation. - Actual variability: Expanding on the centrality, inputting the median of the range would reflect each brick’s genuine composition and allow the effective management of the noise caused by extremes.
For instance, when we consider the Nd Concentration for the MNC-1 brick, which is not missing any values, the mean is 301, but the median is 286. Although this difference may seem minor, it illustrates how only a couple outliers can skew the mean distribution. Hence, subtly influencing the results of the PCA. As opposed to the median, which is effectively anchored and not influenced by these outliers. Furthermore, the greater the number of missing values, the more robust and effective capturing the median is. Consider the Nd Concentration readings for the YOK-1 brick sample, with only 4 replicate readings, the median of this is 179, compared to a mean of 195.
Running PCA
The ‘prcomp’ function in R easily computes the PCA for you. It requires a numeric input, and the decision whether to scale or not. This process brings otherwise negligibly small concentrations into consideration for the entire composition of a brick.
The subsequent scatter plot takes the two most explainable principal components (PC1 and PC2), and plots each brick against them. Typically, only two principal components are chosen in this field as it gives the most basic understanding in 2 dimensions. This gives a ‘score’ for each brick, as explained by only two components.
Loadings
Similarly, taking the 2-dimensional model, we can find the loadings. The loadings indicate how much of each original variable (i.e. concentrations) contribute to a principal component (PC1 and PC2 in this case).
A so-called scree plot is also useful to look at. Fittingly, scree refers to the stones on the side of a mountain, decreasing in height from the peak. The plot has a similar look, which tells us how much of the variance in the original data is explained by each principal component. Clearly, PC1 and PC2 account for a large majority of the dataset (>50%), but miss out on a large portion of accuracy. If we wish to have a more explainable model, we would increase to 3-dimensions and repeat previous steps.
Statistical Tests
Find ideal amount of dimensions? (maybe refer to papers)
Clustering
Another goal that was illustrated during the consultation was the implementation of hierarchical clustering. This is a step that follows after running the PCA to group bricks per elemental composition.
The client may have been able to follow along with our analysis via the code comments and analysis but we made this section to explicitly highlight how to analyse their real data. We also highlight some additional resources that may be useful to the client below:
https://youtu.be/FgakZw6K1QQ?si=XK-YdVoCExH_1XN6 (Fun Accessible Youtube Video on PCA)
https://github.com/AbbasPak/Pattern-recognition-by-using-principal-component-analysis-PCA- (useful overview of PCA)
https://shire.science.uq.edu.au/CONS7008/_book/principal-components-analysis.html (A bit more indepth PCA guide but with great visualisations)
Get your data into ‘tidy’ format in which each row is an observation.
We did this for the sample data you presented, but it may be difficult to follow our code even with the comments and we took this into account
Each row should be a brick observation (1,2,3,4… 10)
Each column is a variable (e.g. test_no, brick, elements, qualitative variables)
Each cell is a single value (400, 200, 1, Yes, No, Green, 1902, etc.)
Decide how you will handle observations with <LOD.
According to the papers given, elements that have a lot of <LOD are simply removed from analysis. The decision to impute or remove as implications for the statistic (mean, median etc) chosen. We recommend that you follow the conventions of your field but we conduct the analysis in the way described by the client in post-consultation correspondence (imputing <LOD to zero)
Code
reactable::reactable(data)
Decide how you will handle observations like Pb (Lead) which appears to be a categorical value.
PCA requires numerical values only. So ranges like >50 which is the way lead is encoded cannot be used. The client requested to impute those as the minimal value but this is important to at least note and consider.
Decide upon a statistic for each element to represent each brick.
We have 10 observations per brick. But we aren’t interested scientifically about how each of these observations differ. That is, the bricks themselves are the unit of analysis and the 10 observations are a way to ensure measurement reliability. So, we must decide on a statistic to represent each brick utilizing these 10 observations.
Mean
The mean can be used but, if values are imputed to zero and values >LOD are something like 500 there is large error or skewing of the data.
If you decide to remove all variables with too many
If each brick has a different number of measurements, then you should use a weighted mean (However, if each brick has the same number of samples then you do not need to worry)
Median
The median is robust to outliers and has the added feature that if >50% of the measurements are <LOD (using zero imputation) then the measurement statistic is zero.
We recommend this as it is robust
Create a new data frame (keeping it ‘tidy’) in which each row is a brick
Now that you have decided on a statistic to represent each brick, you want to create a new data frame in which each row is a brick, and each variable is represented by this statistic (the test number is not needed anymore).
Code
reactable::reactable(medians)
Conduct the PCA using the numerical values only
A PCA is conducted on the numerical values only and so you will need to remove the brick names.
Graphically Annotate your PCA with qualitative variables of interest
We weren’t presented with qualitative variables, but the highlighting in the papers given such as the figures below in which the qualitative variable is a kiln sample. This is not part of a PCA, but was flagged something of interest by the client.
Conclusions and Limitations
We are undergraduate students and are time constrained, we hope that our report was still helpful
The sample data (as indicated by the client) is not representative of the real data that they will use, so take the interpretation with a grain of salt.
We assume structure of final dataset will match the sample provided (same column names, naming of variables, etc.)
Basic limitations of PCA apply
References & Acknowledgements
Extensive use of the latest version of Microsoft Copilot (as of May 2025) was used troubleshooting
---title: "Principal Component Analysis Framework for pXRF Data"subtitle: "Consultants: Jeremy Gebrael, Shervin Faeghi, Vince Tafae"author: ""date: "`r Sys.Date()`"format: html: embed-resources: true code-fold: true code-tools: true toc: true code-overflow: scroll theme: sandstone code-block-bg: true code-block-border-left: "dodgerblue"---## Executive SummarypXRF data was provided to complete a Principal Component Analysis (PCA). The purpose of this report is to demonstrate the exemplar workflow that the client may use when analysing their final dataset (independent from the one provided), as an investigation into the development of brick makings in early stages of Australian colonisation as a byproduct of social enterprise (1788 - 1840). The file and subsequent code provided has been created in the RStudio environment using R. Explanatory comments have been added to each code block, which can be viewed using the unfold toggle. The code is intended to be run sequentially; ensure all previous code blocks have been run before trying to run an independent chunk. ## BackgroundWe conducted an illustrative PCA as shown below. Interpretation can be found in Analysis. However, we note that the primary motivation of our report is to give the client an idea of how to conduct a PCA. Thus of particular interest to the client is our section “How to conduct a PCA’.> Scientific Problem: The client is investigating brick samples in early colonial Australia (1788–1840). A standard in their field is PCA but the client would like guidance in how to conduct this and what this may look like in R code.> Statistical Approach: We conduct a Principal Component Analysis (PCA) on the sample data but also present a ‘cookbook’ that can guide the client in their actual analysis. Moreover, interpretation of PCA is challenging and thus we briefly interpret the illustrative PCA but provide extra resources for the client.## Data Structure and CleaningThe provided example dataset comprises 23 independent brick specimens. With the exception of 3 specimens which were only measured 3 times, the remaining 20 were measured with 10 replicates, giving 209 readings in total. The following code is how to read in the data and remove the last 3 specimen with less measurements (for consistency).```{r, message=FALSE, warning=FALSE}# Libraries used for analysis; may need to be downloaded locallylibrary(tidyverse)library(readxl)library(janitor)library(dplyr)library(plotly)data <- read_excel("data.xlsx") #load in data; must be in same folder as .qmddata <- clean_names(data) #makes coding a bit easier with consistent namingcolnames(data) <- gsub("_concentration$", "", colnames(data)) data <- data %>% filter(!is.na(test_no)) #remove NA rowsdata <- data %>% group_by(bricc_standard_name) %>% filter(n() >= 10) %>% ungroup() #remove last 3 specimenMDL <- data[data$test_no == "Minimal Detection Limit", ] #extracting all MDL for later# Remove the rows that are calculating sample statistics by making them NA and then removing NAdata$test_no <- as.numeric(data$test_no)data <- data %>% filter(!is.na(test_no))# change variable type, factor basically means categorical variabledata <- data %>% mutate(bricc_standard_name = as.factor(bricc_standard_name))```After some preliminary cleaning, we are left with the following column names:```{r}colnames(data)```The instrument used for these measurements is the Olympus Vanta tool. The instrument maintains a minimum range of eligible readings. If a certain reading is less than this range, the measurement reading is returned as less than Limit of Detection, or within the dataset as “<LOD”. This is exemplified in the Uranium (U) column:```{r}head(data$u)```Based on studies by Hsieh & Fischer (2019), Mitchell et al. (2012) and Xu et al (2019), which used the following data cleaning and imputation for <LOD values. This ensures each replicate within the dataset maintains a numerical value and analytical consistency, without disrupting the overall structure of the data. Hence, our client, Matthew advised the following to deal with these LOD values:- All replicates which are < LOD: impute as 0.- Mixed replicates (some ≥ LOD, some < LOD): impute each < LOD reading with the brick’s Minimal Detection Limit (MDL). - Singleton detection (only one replicate ≥ LOD out of 10): Exclude the brick from further analysis, as a single detection is insufficient, and would be problematic for further analysis to impute MDL to all measurement readings. - Problematic Data, no conclusive numeric value: For example, if reading is >8, impute to 8 (min of range). ```{r}head(data$pb)``````{r}data_clean <- data #copy dataframe# Special rule for pb_concentrationif ("pb"%in%colnames(data)) { data$pb <-gsub("^>", "", data$pb)}# 1. Identify key columnsid_cols <-c("bricc_standard_name", "test_no")concentration_cols <-setdiff(names(data), id_cols)# 2. Identify <LOD locations per element column (TRUE if <LOD)is_lod <- data[concentration_cols] =="<LOD"# 3. Replace <LOD with NA and convert all to numericdata_clean[concentration_cols] <-lapply(data[concentration_cols], function(col) {suppressWarnings(as.numeric(replace(col, col =="<LOD", NA)))})# 4. Add MDL info by joining on bricc_standard_nameMDL_renamed <- MDL %>%rename_with(~paste0(., "_MDL"), .cols =-bricc_standard_name)data_with_mdl <-left_join(data_clean, MDL_renamed, by ="bricc_standard_name")# 5. Apply imputation rules for each elementfor (col in concentration_cols) {# Build per-brick summary of non-NA values non_lod_counts <- data_clean %>%group_by(bricc_standard_name) %>%summarise(non_lod =sum(!is.na(.data[[col]])), .groups ="drop")# Create mask of <LOD entries lod_mask <- data[[col]] =="<LOD"for (i inwhich(lod_mask)) { bname <- data_clean$bricc_standard_name[i] n_non_lod <- non_lod_counts$non_lod[non_lod_counts$bricc_standard_name == bname] mdl_val <- data_with_mdl[[paste0(col, "_MDL")]][i]if (length(n_non_lod) ==0||is.na(n_non_lod)) nextif (n_non_lod ==0) {# All values are <LOD for that brick — impute 0 data_clean[[col]][i] <-0 } elseif (n_non_lod ==1) {# Set entire column for that brick to NA data_clean[[col]][data_clean$bricc_standard_name == bname] <-NAnext } else {# More than one — use MDL data_clean[[col]][i] <- mdl_val } }}# Final result: data_clean with imputed values and original columns preserved#FIX; decimal points of MDL (maybe restrict to 2 for consistency), try doing in data_clean```## Analysis### PreliminariesSince the client is mainly interested in the workflow of PCA, we put this as the focus of our report. Briefly, PCA is a statistical method of reducing the dimension of a large dataset (as with the elemental composition of bricks here, which is now n rows by n columns). It does this by finding “principal components” that preserve as much information from the original data as possible. Obviously, some accuracy is sacrificed for interpretability in this sense. The mathematical interpretation can be omitted from a basic understanding. Essentially, these principal components are a combination of the original variables after a linear transformation. ASSUMPTIONS?### AggregationThe structure of the dataset reveals a high frequency of censored readings listed as <LOD among the measurement readings for the brick samples. Given the purpose of the investigation is to highlight compositional differences and infer from this the inherent development trends in brick development, it is integral for you to consider your PCA input as the median aggregation. The aggregated median used as the PCA input will ensure the PCA captures each brick’s typical composition, being the central value, rather than possibly being skewed by the imputed, extreme or missing values. Furthermore, when considering the methodological approach of using the handheld XRF measurement tool, they are inherently prone to producing two major errors which affects the reliability of the data:- Surface irregularities: Rough or uneven surfaces of the bricks scatter the X-rays unpredictably, leading to readings of inconsistent and/or extreme values (Mitchell, 2012). - Instrument noise and signal to noise fluctuations: Handheld XRF detectors, such as the Olympus Vanta, are sensitive to both random electronic noises inherent to the device and background noise, especially at lower concentrations (Portaspecs, 2025). Hence, this is a consideration for how your device is collecting data. As such, the median aggregation allows us to manage these potential errors, this approach: - Centrality of range: Medians are an effective method to deal with extreme values, which would otherwise skew mean-based aggregation. - Actual variability: Expanding on the centrality, inputting the median of the range would reflect each brick’s genuine composition and allow the effective management of the noise caused by extremes. For instance, when we consider the Nd Concentration for the MNC-1 brick, which is not missing any values, the mean is 301, but the median is 286. Although this difference may seem minor, it illustrates how only a couple outliers can skew the mean distribution. Hence, subtly influencing the results of the PCA. As opposed to the median, which is effectively anchored and not influenced by these outliers. Furthermore, the greater the number of missing values, the more robust and effective capturing the median is. Consider the Nd Concentration readings for the YOK-1 brick sample, with only 4 replicate readings, the median of this is 179, compared to a mean of 195.### Running PCAThe ‘prcomp’ function in R easily computes the PCA for you. It requires a numeric input, and the decision whether to scale or not. This process brings otherwise negligibly small concentrations into consideration for the entire composition of a brick.```{r}# Step 1: Compute brick-level medians (excluding test_no)medians <- data_clean %>% select(-test_no) %>% group_by(bricc_standard_name) %>% summarise(across(where(is.numeric), ~ median(.x, na.rm = TRUE)), .groups = "drop")# Step 2: Remove constant or all-NA columns (excluding the ID column)pca_input <- medians %>% select(-bricc_standard_name)pca_input_clean <- pca_input[, sapply(pca_input, function(col) { is.numeric(col) && !all(is.na(col)) && sd(col, na.rm = TRUE) > 0})]#rownames(pca_input_clean) <- medians$bricc_standard_name# Step 3: Run PCAresults <- prcomp(pca_input_clean, scale. = TRUE)```### Scores```{r}scores <- as.data.frame(results$x[, 1:2])scores$brick <- medians$bricc_standard_namep1 <- plot_ly( data = scores, x = ~PC1, y = ~PC2, type = "scatter", mode = "markers+text", text = ~brick, textposition = "top center", marker = list(color = 'black'), name = "Bricks") %>% layout( title = "PCA Scores (Bricks)", xaxis = list(title = "PC1"), yaxis = list(title = "PC2") )#Displayp1```The subsequent scatter plot takes the two most explainable principal components (PC1 and PC2), and plots each brick against them. Typically, only two principal components are chosen in this field as it gives the most basic understanding in 2 dimensions. This gives a ‘score’ for each brick, as explained by only two components. ### LoadingsSimilarly, taking the 2-dimensional model, we can find the loadings. The loadings indicate how much of each original variable (i.e. concentrations) contribute to a principal component (PC1 and PC2 in this case). ```{r}loadings <-as.data.frame(results$rotation[, 1:2])loadings$element <-rownames(results$rotation)# Generate arrow shapes; doesnt work yetarrow_shapes <-lapply(1:nrow(loadings), function(i) {list(type ="line",x0 =0, y0 =0,x1 = loadings$PC1[i],y1 = loadings$PC2[i],line =list(color ="red", width =2) )})p2 <-plot_ly(type ="scatter", mode ="text",x = loadings$PC1,y = loadings$PC2,text = loadings$element,textposition ="top center",hoverinfo ="text",showlegend =FALSE) %>%layout(title ="PCA Loadings (Element Concentrations)",xaxis =list(title ="PC1"),yaxis =list(title ="PC2"),shapes = arrow_shapes )#Displayp2```### Scree Plot```{r}factoextra::fviz_eig(results, addlabels =TRUE)```A so-called scree plot is also useful to look at. Fittingly, scree refers to the stones on the side of a mountain, decreasing in height from the peak. The plot has a similar look, which tells us how much of the variance in the original data is explained by each principal component. Clearly, PC1 and PC2 account for a large majority of the dataset (>50%), but miss out on a large portion of accuracy. If we wish to have a more explainable model, we would increase to 3-dimensions and repeat previous steps. ### Statistical TestsFind ideal amount of dimensions? (maybe refer to papers)## ClusteringAnother goal that was illustrated during the consultation was the implementation of hierarchical clustering. This is a step that follows after running the PCA to group bricks per elemental composition. ## How to run a PCA for your real Archeological data::: panel-tabset### Additional ResourcesThe client may have been able to follow along with our analysis via the code comments and analysis but we made this section to explicitly highlight how to analyse their real data. We also highlight some additional resources that may be useful to the client below:- https://youtu.be/FgakZw6K1QQ?si=XK-YdVoCExH_1XN6 (Fun Accessible Youtube Video on PCA)- https://github.com/AbbasPak/Pattern-recognition-by-using-principal-component-analysis-PCA- (useful overview of PCA)- https://shire.science.uq.edu.au/CONS7008/_book/principal-components-analysis.html (A bit more indepth PCA guide but with great visualisations)### Step 1**Get your data into ‘tidy’ format in which each row is an observation.**We did this for the sample data you presented, but it may be difficult to follow our code even with the comments and we took this into account- Each row should be a brick observation (1,2,3,4… 10)- Each column is a variable (e.g. test_no, brick, elements, qualitative variables)- Each cell is a single value (400, 200, 1, Yes, No, Green, 1902, etc.)### Step 2**Decide how you will handle observations with <LOD.**According to the papers given, elements that have a lot of <LOD are simply removed from analysis. The decision to impute or remove as implications for the statistic (mean, median etc) chosen. We recommend that you follow the conventions of your field but we conduct the analysis in the way described by the client in post-consultation correspondence (imputing <LOD to zero)```{r}reactable::reactable(data)```**Decide how you will handle observations like Pb (Lead) which appears to be a categorical value.**PCA requires numerical values only. So ranges like >50 which is the way lead is encoded cannot be used. The client requested to impute those as the minimal value but this is important to at least note and consider.### Step 3**Decide upon a statistic for each element to represent each brick.**We have 10 observations per brick. But we aren’t interested scientifically about how each of these observations differ. That is, the bricks themselves are the unit of analysis and the 10 observations are a way to ensure measurement reliability. So, we must decide on a statistic to represent each brick utilizing these 10 observations.**Mean**- The mean can be used but, if values are imputed to zero and values >LOD are something like 500 there is large error or skewing of the data.- If you decide to remove all variables with too many- If each brick has a different number of measurements, then you should use a weighted mean (However, if each brick has the same number of samples then you do not need to worry)**Median**- The median is robust to outliers and has the added feature that if >50% of the measurements are <LOD (using zero imputation) then the measurement statistic is zero.- We recommend this as it is robust### Step 4**Create a new data frame (keeping it ‘tidy’) in which each row is a brick**Now that you have decided on a statistic to represent each brick, you want to create a new data frame in which each row is a brick, and each variable is represented by this statistic (the test number is not needed anymore).```{r}reactable::reactable(medians)```### Step 5**Conduct the PCA using the numerical values only**A PCA is conducted on the numerical values only and so you will need to remove the brick names.### Step 6**Graphically Annotate your PCA with qualitative variables of interest**We weren’t presented with qualitative variables, but the highlighting in the papers given such as the figures below in which the qualitative variable is a kiln sample. This is not part of a PCA, but was flagged something of interest by the client.::: ## Conclusions and Limitations- We are undergraduate students and are time constrained, we hope that our report was still helpful- The sample data (as indicated by the client) is not representative of the real data that they will use, so take the interpretation with a grain of salt.- We assume structure of final dataset will match the sample provided (same column names, naming of variables, etc.)- Basic limitations of PCA apply## References & Acknowledgements- Extensive use of the latest version of Microsoft Copilot (as of May 2025) was used troubleshooting- https://stats.stackexchange.com/questions/222/what-are-principal-component-scores